Importing packages

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

Preparing the data

  • Opening file
  • Normalising data
  • Dividing into train and test datasets
In [2]:
df = pd.read_csv("kc_house_data.csv").set_index('id')
df['date'] = df['date'].apply(lambda x: int(x.split('T')[0]))
df.astype('float')

norm_df = (df - df.mean()) / df.std()
label = norm_df.pop('price')

train, test, labels_train, labels_test = train_test_split(norm_df, label, train_size=0.80)

train.columns
Out[2]:
Index(['date', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

Creating random forest and linear regression models

  • Fitting the model
  • Predicting test data
In [3]:
def classifier(model):
    model.fit(train, labels_train)
    pred = model.predict(test)
    print(f"MSE: {mean_squared_error(pred, labels_test)}, MAE: {mean_absolute_error(pred, labels_test)}")
    return (model, pred)
In [4]:
model_rf, pred_rf = classifier(RandomForestRegressor(n_estimators = 100, random_state = 0))
model_lin, pred_lin = classifier(LinearRegression())
MSE: 0.12002199092384985, MAE: 0.18138663507142141
MSE: 0.26549235700713747, MAE: 0.33627265919543226

LIME decomposition of one observation

  • Taking two observations from test dataframe and their predictions
  • Generationg LIME decompositions
In [5]:
import lime
import lime.lime_tabular

explainer = lime.lime_tabular.LimeTabularExplainer(train, 
                                                   mode='regression',
                                                   feature_names=train.columns,
                                                   discretize_continuous=False
                                                  )

def exp_rf_inst(i):
    exp = explainer.explain_instance(train.iloc[i], model_rf.predict)
    exp.show_in_notebook(show_table=True, show_all=False)
In [6]:
exp_rf_inst(10)

In 10th observation we can see, that the most positive variables are sqft_living, lat and grade, with 0.27, 0.25 and 0.2 respectively. First negative variable is long with -0.09. That means that sqft_living and lat counts nearly exactly the same.

In [7]:
exp_rf_inst(100)

In 100th observation we can see, that the most positive variables are sqft_living, lat and grade, with 0.27, 0.26 and 0.19 respectively. First negative variable is long with -0.1. That means that sqft_living and lat counts nearly exactly the same, and is very close to explanation for 10th observation.

In [8]:
exp_rf_inst(1000)

In 1000th observation we can see, that the most positive variables are sqft_living, lat and grade, with 0.28, 0.26 and 0.19 respectively. First negative variable is long with -0.11. That means that sqft_living and lat counts nearly exactly the same. Sqft_living and lat are same as with 100th observation.

After comparing 3 observations: the answers are stable. First 4 (most important) variables are nearly the same in all 3 observations. First negative is always long with stable explanation.

LIME comparison between linear regression and random forest models

In [9]:
def exp_lin_inst(i):
    exp = explainer.explain_instance(train.iloc[i], model_lin.predict)
    exp.show_in_notebook(show_table=True, show_all=False)
In [10]:
exp_lin_inst(10)

First 5 most important variables are grade, lat, sqft_living, sqft_above and yr_built with 0.3, 0.23, 0.21 and -0.21 respectively. Let's recall explanation for random forest model.

In [11]:
exp_rf_inst(10)

There are many differences between LIME decompositions between models for 10th observation. Biggest difference is in sqft_above, which is negligible negative in random forest, but large value in linear model.

Lime explanation was very stable on all observation in one model, but very different between models.

In [ ]: